Code
pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse)Sherinah Rashid
May 5, 2023
This Take-Home Exercise is part of the VAST Challenge 2023. The country of Oceanus has sought FishEye International’s help in identifying companies possibly engaged in illegal, unreported, and unregulated (IUU) fishing. They hope to understand business relationships, including finding links that will help them stop IUU fishing and protect marine species that are affected by it.
FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure. FishEye now wants your help to develop a new visual analytics approach to better understand fishing business anomalies.
In line with this, this page will attempt to answer the following task under Mini-Challenge 3 of the VAST Challenge:
Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.
Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user. Limit your response to 400 words and 5 images.
Fisheye has transformed the data into a undirected multi-graph consisting of 27,622 nodes and 24,038 edges. Details of the attributes provided are listed below:
Nodes:
type – Possible node types include: {company and person}. Possible node sub types include: {beneficial owner, company contacts}.
country – Country associated with the entity. This can be a full country or a two-letter country code.
product_services – Description of product services that the “id” node does.
revenue_omu – Operating revenue of the “id” node in Oceanus Monetary Units.
id – Identifier of the node is also the name of the entry.
role – The subset of the “type” node, not in every node attribute.
dataset – Always “MC3”.
Links:
type – Possible edge types include: {person}. Possible edge sub types include: {beneficial owner, company contacts}.
source – ID of the source node.
target – ID of the target node.
dataset – Always “MC3”.
Let’s first load the packages and datasets to be used.
In the code chunk below, fromJSON() of jsonlite package is used to import MC3.json into R environment. Examination of the dataset shows that it is a large list R object.
The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.
distinct() is used to ensure that there will be no duplicated records.mutate() and as.character() are used to convert the field data type from list to character.group_by() and summarise() are used to count the number of unique links.filter(source!=target) is to ensure that there are no records with similar source and target.The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.
mutate() and as.character() are used to convert the field data type from list to character.as.character(). Then, as.numeric() will be used to convert them into numeric data type.select() is used to re-organise the order of the fields.In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame. The report reveals that there is no missing values.
| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table.
Let’s plot a bar graph to show the type of edges. As we can see from the barchart below, there are about 16,000 edges for beneficial owner, and about 7,500 edges for company contacts.
Similarly, skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame. The report reveals that there is no missing values.
| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table.
Let’s plot a bar graph to show the type of edges. As we can see from the barchart below, there are about 12,000 nodes for beneficial owners, 8,750 nodes for company, and 7,000 nodes for company contacts.
Instead of using the nodes data table extracted from the original dataset, we will prepare a new nodes data table by using the source and target fields of mc3_edges table. This is necessary to ensure that the nodes in the nodes data tables include all the source and target values.
We will then calculate the betweenness and closeness centrality measures.
Now, let’s plot the network graph using the tidygraph() function.
Maybe we can see what categories there are for the nodes and then see if there is overlap
# A tibble: 10 × 2
product_services n
<chr> <int>
1 <NA> 29241
2 character(0) 3811
3 Unknown 2076
4 Fish and seafood products 37
5 Seafood products 24
6 Canning, processing and manufacturing of seafood and other aquatic pro… 18
7 Fish and fish products 18
8 Footwear 17
9 Food products 12
10 Seafood 10